Exploratory Data Analysis of Joined Algae Clade & Bacteria in Coral Samples

read_csv("../output/coral_data.csv") -> corals
## Rows: 658 Columns: 5897
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr    (6): sample_id, Clade, Majority, species, region, reef
## dbl (5891): its2_count, ITS2_f, ASV0001, ASV0002, ASV0003, ASV0004, ASV0005,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#ITS2_f should be a factor
corals %>% 
  mutate(ITS2_f = as.factor(ITS2_f)) -> corals
dim(corals)
## [1]  658 5897

What are the proportions of observations per clade?

corals %>% 
  group_by(Clade) %>% 
  summarise(n = n()) 
## # A tibble: 2 × 2
##   Clade     n
##   <chr> <int>
## 1 A       606
## 2 C        52
  • Clade A has 606 observations. 52/(606+52) = about 8 %

  • Clade C has 52 observations. 606/(606+52) = about 92 %

Notes: If we are trying to explain the variability of clades based on bacteria, logistic regression would be used to explain or predict either clade A or C with bacteria ASVs as columns; however, taking out other info is no advisable yet as they are all possible preidctors.

  • How correlated are the top 2 ASVs with each of the algae variables?

  • Logging the ASV counts presents better visualization:

corals %>% 
  ggplot(aes(x = log(ASV0001), y = log(ASV0002), color = Clade)) +
  geom_point(position = "jitter") + 
  facet_wrap(~species)

  • Clade A is far more prevalent as is also shown in the number of abservations per Clade.

  • Distributions are different per species. ASV1 and 2 are more clustered for species P and more spread out in terms of counts for species S.

  • From coseq() K-means clustering, this is perhaps why ASV1 and 2 were consistently clustered together in species P but not for species S. Perhaps ASVs 1 and 2 play a different role in species S.

Data Subsetting of first 1000 most abundant ASVs per Species

Species P Data Subsetting

corals %>% 
  filter(species =="P") %>%  # 296:  cut off will be 293 inclusive
  select(-c(1:8)) -> corals_p

### These are the ASVs that are found in at least 3 samples or more for species P
map(corals_p, ~sum(.==0)) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), 
               names_to = "ASV_column", 
               values_to = "sum_col_eq_0") %>% 
  arrange(sum_col_eq_0) %>% 
  # cut off includes 293 or more because 296-293 = 3 
  filter(sum_col_eq_0 >=293) -> asvs_speciesP
  • Which ASVs are NOT found in Species P? ASV3
map(corals_p, ~sum(.==0)) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(), 
               names_to = "ASV_column", 
               values_to = "sum_col_eq_0") %>% 
  # cut off includes 293 or more because 296-293 = 3 
  filter(sum_col_eq_0 == 0)
## # A tibble: 1 × 2
##   ASV_column sum_col_eq_0
##   <chr>             <int>
## 1 ASV0003               0
  • Take out ASV3 for species P data subset
# use original data frame to get other columns back into data subset
corals %>% 
  filter(species == "P") %>% 
  # take out asv 3 bc we know it's not found in this species
  # take out species column 
  select(-ASV0003, -species) %>% 
  group_by(Clade  ) %>% 
  summarise(n = n())
## # A tibble: 2 × 2
##   Clade     n
##   <chr> <int>
## 1 A       283
## 2 C        13
  • Note: 13/283 is about 5 %

Species S Data Subsetting

Visualization Experiment with t-SNE Plot:

  • t-SNE is a way to visualize high-dimensional data and is recommended for hundreds of thousands of ASVs. Client indicated that it is also possible to use first 1000 ASVs as these are already ranked in terms of the abundance they were found in the data.

Code adapted from here

library(Rtsne)
  • For t-SNE, select top 1000 ASVs no matter species to test out visualization
corals %>% 
  select(starts_with("ASV")[1:1000]) %>% 
  as.matrix() -> asv_matrix

Example t-SNE visualization

library(tsne)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data("iris")

features <- subset(iris, select = -c(Species)) 

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.376979658833158 |1st Qu. : 0.45299119244845 |Median : 0.509480199794486 |Mean : 0.520650714341092 |3rd Qu. : 0.579571467464058 |Max. : 0.758492715638686 |
## Epoch: Iteration #100 error is: 10.6877963361652
## Epoch: Iteration #200 error is: 0.0730593518613691
## Epoch: Iteration #300 error is: 0.0688210768935925
## Epoch: Iteration #400 error is: 0.0677845552661532
## Epoch: Iteration #500 error is: 0.0674575742993366
## Epoch: Iteration #600 error is: 0.0672632485595604
## Epoch: Iteration #700 error is: 0.0671387737166583
## Epoch: Iteration #800 error is: 0.0670575761754511
## Epoch: Iteration #900 error is: 0.0670020352301366
## Epoch: Iteration #1000 error is: 0.0669657774358675
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~iris$Species)

fig <- fig %>%
  layout(
    plot_bgcolor = "#e5ecf6"
  )

fig
library(tsne)
library(plotly)


features <- subset(asv_matrix) 

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.0717351762735381 |1st Qu. : 0.14417917387887 |Median : 0.25304682961265 |Mean : 0.259781738918223 |3rd Qu. : 0.331287316162124 |Max. : 0.881559216277751 |
## Epoch: Iteration #100 error is: 9.5805974100228
## Epoch: Iteration #200 error is: 0.215750400209934
## Epoch: Iteration #300 error is: 0.167603391129179
## Epoch: Iteration #400 error is: 0.152457463386035
## Epoch: Iteration #500 error is: 0.146676402504261
## Epoch: Iteration #600 error is: 0.143524659317524
## Epoch: Iteration #700 error is: 0.14148496677019
## Epoch: Iteration #800 error is: 0.140058690361463
## Epoch: Iteration #900 error is: 0.1390071568364
## Epoch: Iteration #1000 error is: 0.138210903902764
tsne <- data.frame(tsne)
pdb <- cbind(tsne,corals$species)
options(warn = -1)
fig <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$species)

fig <- fig %>%
  layout(
    plot_bgcolor = "#e5ecf6"
  )

fig
library(tsne)
library(plotly)

features <- subset(asv_matrix) 

set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.0717351762735381 |1st Qu. : 0.14417917387887 |Median : 0.25304682961265 |Mean : 0.259781738918223 |3rd Qu. : 0.331287316162124 |Max. : 0.881559216277751 |
## Epoch: Iteration #100 error is: 9.5805974100228
## Epoch: Iteration #200 error is: 0.215750400209934
## Epoch: Iteration #300 error is: 0.167603391129179
## Epoch: Iteration #400 error is: 0.152457463386035
## Epoch: Iteration #500 error is: 0.146676402504261
## Epoch: Iteration #600 error is: 0.143524659317524
## Epoch: Iteration #700 error is: 0.14148496677019
## Epoch: Iteration #800 error is: 0.140058690361463
## Epoch: Iteration #900 error is: 0.1390071568364
## Epoch: Iteration #1000 error is: 0.138210903902764
tsne <- data.frame(tsne)
pdb <- cbind(tsne,corals$species)
options(warn = -1)
fig <-  plot_ly(data = pdb ,x =  ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$Clade)

fig <- fig %>%
  layout(
    plot_bgcolor = "#e5ecf6"
  )

fig

GGpairs on top 10 ASVs and Clade for Species P

corals[, -c(1,3,4,5,7,8) ] %>% 
  filter(species=="P") %>% 
  select(1, 3:12) %>% 
  GGally::ggpairs(aes(color = Clade)) -> ggpairs_species_P
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggsave( "../plots/ggpairs_species_P.png", ggpairs_species_P, width = 14)
## Saving 14 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

GGpairs on top 10 ASVs and Clade for Species P

corals[, -c(1,3,4,5,7,8) ] %>% 
  filter(species=="S") %>% 
  select(1, 3:12) %>% 
  GGally::ggpairs(aes(color = Clade)) -> ggpairs_species_S
ggsave( "../plots/ggpairs_species_S.png", ggpairs_species_S, width = 14)
## Saving 14 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.